ANS‐SCMC: A matrix completion method based on adaptive neighbourhood similarity and sparse constraints for predicting microbe‐disease associations

Abstract The use of matrix completion methods to predict the association between microbes and diseases can effectively improve treatment efficiency. However, the similarity measures used in the existing methods are often influenced by various factors such as neighbourhood size, choice of similarity metric, or multiple parameters for similarity fusion, making it challenging. Additionally, matrix completion is currently limited by the sparsity of the initial association matrix, which restricts its predictive performance. To address these problems, we propose a matrix completion method based on adaptive neighbourhood similarity and sparse constraints (ANS‐SCMC) for predict microbe‐disease potential associations. Adaptive neighbourhood similarity learning dynamically uses the decomposition results as effective information for the next learning iteration by simultaneously performing local manifold structure learning and decomposition. This approach effectively preserves fine local structure information and avoids the influence of weight parameters directly involved in similarity measurement. Additionally, the sparse constraint‐based matrix completion approach can better handle the sparsity challenge in the association matrix. Finally, the algorithm we proposed has achieved significantly higher predictive performance in the validation compared to several commonly used prediction methods proposed to date. Furthermore, in the case study, the prediction algorithm achieved an accuracy of up to 80% for the top 10 microbes associated with type 1 diabetes and 100% for Crohn's disease respectively.

models for disease prevention and treatment prognosis development, it has extensive research and application value.With the accumulation of biomedical data and the synergistic development of computational technology, the application of machine learning models to predict the correlation between microbes and diseases has been widely studied and applied. 4Currently, modelling algorithms used in this field mainly include network-based methods, deep learning-based methods and matrix-based methods. 5twork based methods typically infer possible correlations by analysing the network topology structures constructed in multiple databases.This approach relies on the structural information between different networks and identify association patterns. 6For example, Zou et al. 7 proposed a method based on a double random walk (RW) on heterogeneous networks.Yin et al. 8 proposed a label propagation (LP) method based on multi-similarity fusion.Although LP and RW algorithms are efficient and easy to use, but their prediction methods cover limited biological information. 9Li et al. 10 proposed the KATZBNRA model based on the binary network recommendation algorithm and KATZ.Wang et al. 11 proposed a network embedding (NE) method based on heterogeneous networks and global graph feature learning.Among them, the KATZ Measure method can simultaneously reconstruct potential associations with large-scale networks, but its calculation is based on the GIP kernel similarity may have a certain impact on the known associations. 12On the other hand, the concept of meta-paths used in NE can clearly capture basic higher-order proximity.However, as network information increases, the complexity of training embeddings also increase. 131][22][23] These technologies provide new research approaches for microbe-disease association prediction by building complex nonlinear models and mining deep features and relationships in data.
Many scholars have also applied deep learning technologies to microbe-disease association prediction.For example, Lu et al. proposed a method based on autoencoders and GCN to predict the association between microbes and diseases. 24Liu D et al. proposed a method based on graph attention networks (GAT) to predict the association between microbes and diseases. 25 Wang et al. proposed a method for predicting potential microbe-disease associations based on multisource features and deep learning, which used deep sparse autoencoder neural networks (SAE). 14GCN improves translation invariance on non-matrix structured data, but its flexibility and scalability need to be improved. 26GAT has a significant effect on the aggregation of graph neural networks, but it has some difficulties in the aggregation of high-order neighbourhoods and is sensitive to parameter initialization. 27In the prediction of microbe-disease associations, SAE can effectively mine important features, compress secondary features, and generate lower-dimensional and sparser abstract features, but it cannot clearly distinguish between the activity and hiddenness of nodes, and the selection of sparsity parameters is also difficult. 28trix based methods, such as matrix factorization and matrix completion, mainly work by decomposing the original high-dimensional input matrix into two lower dimensional matrices.These two smaller matrices are continuously updated through optimization algorithms during the iteration process, with the aim of making their product as close as the original matrix as possible.For example, Liu et al. 29 proposed a non-negative matrix factorization (NMF) based on graph regularization.Yang et al. 30 proposed a multi-similarity bilinear matrix factorization method based on similarity constrained matrix factorization (SC-MF).Xu et al. 31 proposed a new collaborative weighted non-negative matrix factorization method based on collaborative matrix factorization (collaborative-MF).The method based on matrix decomposition is a method that can discover deeper potential connections.Meanwhile, the spatial complexity of matrix decomposition is relatively low.However, the MF-based method has more parameters.Therefore, the parameter selection is more difficult and the model training time is longer. 32Another matrix-based method is matrix completion.The matrix completion method restores the matrix of missing values to a complete matrix by decomposing a matrix of missing values into two or more matrices mainly through matrix decomposition, and then multiplying these decomposed matrices to obtain an approximate matrix of the original matrix.For example, Shi et al. proposed a binary matrix complementation method. 33Long et al. 34 proposed a graph regularized non-negative matrix complementation method.The matrix decomposition-based complementation model is a method that avoids complex matrix singular value decomposition, and it can be implemented in a distributed environment, but it belongs to a kind of non-convex optimization, which may result in a non-globally optimal solution. 35The matrix-based methods mentioned above rely on the similarity information between microbes and diseases for prediction.
Moreover, the sparsity issue of the initial association matrix also significantly affects the predictive performance of such methods.
Considering the limitations of the existing matrix completion methods in terms of the reliability of similarity information and the sparsity of the association matrix, we propose a matrix completion method based on adaptive neighbourhood similarity and sparse constraints (ANS-SCMC).We first calculated the similarity between the matrices of disease Gaussian interaction contour kernels and microbe Gaussian interaction contour kernels in the correlation matrix.Based on the obtained disease similarity and microbial similarity information, we use WKNKN to preprocess the microbe-disease feature representation.
Next, we obtain the adaptive neighbourhood similarity information of microbes and diseases based on Laplacian flow pattern local learning method.Then, based on adaptive neighbourhood similarity information, we used the matrix completion with sparse constraints to start with the existing matrix information, decompose the preprocessed loss function into low rank sub-matrixes, and continuously capture the correlations of the original matrix through gradient descent, continuously filling in the missing parts of the original correlation matrix.Finally, we obtained the final predicted score matrix.To validate the effectiveness of ANS-SCMC, we compared its predictive performance with six other state-of-the-art MDA identification models.The results demonstrated that ANS-SCMC outperformed the other models in terms of prediction accuracy.In the case study, ANS-SCMC has been demonstrated as a reliable and effective model with strong predictive capability for microbe-disease association.

| Data preparation
We obtained the relevant data from the human microbe-disease association database, which is available from the Human Microbe-Disease Association Database (HMDAD, http:// www.cuilab.cn/ hmdad ).These data are mainly derived from microbe studies based on 16 s RNA sequencing, which only give information at the genus level.The database includes 483 experimentally validated human microbe-disease associations involving 39 different human diseases and 292 microbes.We have compiled 450 different associations based on different evidence.
In our study, these microbe-disease associations were constructed into an adjacency matrix A. If there is an association between a certain microbe m(i) and a certain disease d(j), the corresponding value of A(m(i), d(j)) is 1, and if there is no association, the corresponding value of A(m(i), d(j)) is 0. In addition, we defined two variables n m and n d , which is used to represent the number of microbe species and the number of disease species involved in the study.

| Kernel similarity of Gaussian interaction profiles of microbes
Under the assumption that microbes with functional similarity are usually associated with similar diseases and therefore share similar interaction patterns with diseases, we adopted a method to compute microbe similarity from known microbe-disease association networks using Gaussian interaction profiles of microbe to kernel similarity.This process typically consists of two steps: firstly, we defined binary vectors AM(m(i)) declaring the interaction profiles of microbe m(i), which are used to record whether the microbe m(i) are associated with each disease; Then, we calculated the kernel similarity between each pair of microbe base on the Gaussian interaction profiles of the microbes.After calculating the similarity between pairs of microbes, we can construct a Gaussian interaction distribution kernel similarity matrix MF as show in Equation (1).
Here, m regulates the normalized kernel bandwidth based on the new bandwidth parameter ′ m as in Equation (2).Each MF(m(i), m(j)) record represents the Gaussian interaction profile kernel similarity between the microbe m(i) and m(j).

| Gaussian interactive contour kernel similarity for disease
As described in Section 2.2, it is assumed that functionally similar microbes are usually associated with similar diseases.Therefore, we can calculate the Gaussian interaction profile kernel similarity of diseases in a similar way to microbes.We define binary vector

AD(m(i)) to represent the interaction spectrum of the disease d(i)
, which records whether each disease d(i) is associated with each microbe.Then, we calculate the kernel similarity between each pair of diseases based on the Gaussian interaction profile of the disease.
After calculating the similarity between disease pairs, we can construct a Gaussian interaction distribution kernel similarity matrix DF , as shown in Equation ( 3): where the kernel bandwidth parameter

| Methodology overview
In this study, ANS-SCMC was utilized to predict possible relationships between microbes and diseases.We first downloaded human microbe-disease associations, calculated the Gaussian interaction contour kernel similarity between diseases and microbes in them, and used WKNKN to reduce the sparsity of the disease-microbe association matrix.Then the adaptive domain similarity matrix was obtained based on the disease-microbe association matrix.Finally, we used the matrix completion with sparse constraints to update the resulting loss function with cyclic gradient descent to obtain the final result.Figure 1 depicts the algorithmic flow of ANS-SCMC, which consists of three parts: the Correlation probability matrix with WKNKN algorithm, Adaptive neighbourhood similarity (ANS) and Matrix Completion with Sparse Constraints (SCMC).

| Correlation probability matrix with WKNKN algorithm
To characterize the relationship between microbes and diseases, we used the WKNKN algorithm to reduce the sparsity of the disease-microbe association matrix A. An experimentally proven microbe (disease) is associated with at least one disease (microbe), but there are still undiscovered potential interactions in the diseasemicrobe association matrix A. Therefore, we propose to use WKNKN as a preprocessing step to explore the possibility of potential interactions between microbes and diseases.
Step 1: For each microbe m i , we use the Gaussian interaction profile similarity matrix MF obtained in 2.2 to find k known microbes adjacent to it, and use the scores of these microbes to infer the interaction possibility spectrum of m i .We formulate WKNKN as shown in Equation ( 5): where l 1 to l k are the k nearest known neighbours of m i in descending order, w n m is the weight coefficient.
Step 2: Similarly, for each disease d j , we also infer the probability spectrum of its interaction based on the Gaussian interaction profile similarity matrix DF obtained in 2.3, as shown in Equation ( 6): where d 1 to d K are the K nearest known neighbours represented by d j in descending order, w n d is the weight coefficient.
Step 3: Finally, we denote the obtained average of A m and A d as Q and fill the blank entries in matrix A with the corresponding values in Q as shown in Equation ( 7) and ( 8): We call this matrix processed by the WKNKN algorithm the filled correlation matrix and name it A W .

| Adaptive neighbourhood similarity
The traditional Laplacian flow learning method mainly reconstructs the local structural characteristics of the data manifold by constructing an adjacency matrix, ultimately achieving the purpose of enhancing the smoothness of data in its linear and non-linear space.This is based on an assumption: if two data points are adjacent in the geometric structure of the original space, then they should also maintain similarity in the new representation space.However, when constructing such a graph, traditional methods do not fully consider the actual number of data sets, which may result in the inclusion of many unnecessary interactions in the graph, thus failing to accurately reflect the internal geometric structure of the data.In order to overcome the limitation that the Laplacian graph may only produce trivial solutions, we adopt a method that does not rely on distance, see Equation (9): w n m A m m n m , : where a i and a j represent the i and j lines of A, w ij represents the similarity between microbe i and microbe j, n m is the known number of row vectors in the association matrix, and the regularization term r 0 w 2 ij ensures that all optimal solutions are close to the data point x i and have the same probability 1 n m .Here, the regularization parameter r 0 is used to adjust the control ability of the adaptive neighbourhood size in the local structure similarity graph matrix.This parameter can be understood as a priori knowledge that helps us determine the range of domain assignment.
Optimization of the solution: The above equation its equivalent to Equation (10) where L is the matrix of temporary proxy variables used to approximate W − D W , D W is the diagonal matrix of W, r 1 and r 2 are regular term hyperparameter.So W and L are solved as in Equation ( 11) and (12): Based on this we can obtain Equation ( 13), (14): where E is the unit matrix of n m × n m .
The iterative updating stops when the condition is satisfied, for this we will obtain W t as the final similarity matrix MS for the microbes.
Similarly, we can obtain the similarity matrix DS for the diseases based on the same method.
In the calculation of the adaptive similarity matrix, we allocate weights between adaptive and optimal neighbours by dynamically calculating the differences in vector data for each column or row of the known correlation matrix, thereby obtaining the similarity matrix of microbe or disease data.Our model can perform local manifold structure learning on the information of the known correlation matrix, adaptively balancing the differences in known correlation information and the minimization of prior information.During the process of learning the data similarity matrix, we continuously adjust the value of weight W to preserve the adaptive local structure of the data.The details of the adaptive similarity steps are shown in Figure 2.

| Matrix completion with sparse constraints
SCMC is a matrix completion method.Its core idea is to approximate the target matrix by iteratively generating an approximate matrix through the inner product of two sub-matrices.Specifically, for a microbe-disease association matrix A ∈ R n m * n d containing missing information, where the number of microbes and the number of diseases is represented by n m and n d , respectively.We decompose the preprocessed matrix into two-dimensional sub-matrices: the microbe feature matrix M ∈ R n m * r and the disease feature matrix

��
, and simulate A W by calculating the inner product of these two matrices, where r represents the number of low-dimensional space.In this process, the disease-microbe association information is mapped to a common low-dimensional matrix, and we use this method to predict the possibility of disease-microbe association.In the process of matrix completion, we continuously narrow the gap between the approximate matrix and the target matrix through gradient descent to optimize the above sub-matrices.In order to prevent overfitting of the model, SCMC calculates a complexity sub-matrix that contains a penalty term to simulate the difference between the simulated models.In addition, SCMC retains the original information of the The details of the steps for adaptive neighbourhood similarity.
disease-microbe association and continuously adds the predicted association information to the sub-matrix model to continuously adjust and update the sub-matrix.Therefore, for a specific microbe m i and disease d j , we express its association probability as Equation (15): or in matrix form, as in Equation ( 16).
Here, A W i,j represents the elements in row i and column j of A W , p i,j represents the association probability between microbe m i and disease d j , while m i and d j represent the i th row and j th row of the microbe submatrix M and disease submatrix D, p A W i,j i,j is a conditional probability expression that represents the probability of an association between microbe m i and disease d j , given a given microbe m i and disease d j .In addition, we added the initial weights w i,j of microbe m i and disease d j to improve the model's fitting ability and performance.
We derived p A W | M, D through Bayesian reasoning, as shown in Equation ( 17): The loss function is obtained as in Equation (18).
To improve the accuracy of the prediction method, we introduced a sparse constraint coefficient R to adjust the sparsity of M and D. In addition, considering that similar microbes are likely to be associated with similar diseases, we further expanded the loss function.Specifically, we used the microbe similarity matrix MF ∈ R n m * n m , where each entry MF i,j represents the similarity between microbes m i and m j Similarly, we used the disease similarity matrix DF ∈ R n d * n d , where each entry DF i,j represents the similarity between diseases d i and d j .We explained the association between similar diseases and similar microbes by reducing the distance between microbe characteristics, as shown in formula (19): Similarly, the similarity between diseases is minimized as in Equation ( 20): We introduce regularization term 2 and regularization term 3 into 1 and add two additional adjustable parameters and .Finally, we transform the loss function into Equation ( 21): SCMC uses the iterative gradient descent method (AdaGrad) to optimize the model.During the iteration process, we write the partial derivative of the loss function as formula ( 22) and ( 23) to guide the optimization: where BM represents the partial derivative of LOSS over M, BD represents the partial derivative of LOSS over D, ⊙ denotes the Hadamard product, and the submatrices M and D will be updated according to a specific formula.For details, please refer to Equation ( 24) and ( 25): Among them, represents the learning rate, which is a key parameter in the iterative optimization process.Based on experience, we usually set to a fixed value 0.1 to simplify the calculation.The superscript n represents the current number of iterations.The update of the submatrices M and D will continue until the end condition max( Δ M, Δ D) < 10 −5 is reached; ‖BM‖ F and ‖BD‖ F represent the Frobenius normal form of BM and BD respectively, as shown in Equation ( 28) and ( 29): Based on the above description, we have summarized the process of SCMC: the association probability matrix A W is used to construct the initial component matrices M 0 ∈ R n m * r and D 0 ∈ R n d * r .According to formula (16), the initial probability matrix P 0 is calculated and the loss function is constructed, as shown in formula (18).Combined with the two weight matrices derived by the ANS algorithm, the loss function formula ( 18) is rewritten as formula (21) based on formulas (19) and (20); Using the loss function obtained from formula ( 21), (15) calculate the partial derivatives of M and D based on formulas (22)   and ( 23), and update the matrices M and D based on formulas (24)   and (25).Repeat the above steps until max( Δ M, Δ D) < 10 −5 ; Finally, according to formula (17), calculate the correlation prediction matrix P n+1 using M n+1 and D n+1 .

| Assessment indicators
LOOCV and 5-fold CV are used to evaluate the performance of our method and other state-of-the-art microbe-disease prediction methods.In LOOCV, each known association between microorganisms and diseases is selected as a test sample, while other known associations are training samples.In 5-fold CV, known associations are considered positive samples, and unobserved associations are considered negative samples.All positive samples were randomly divided into five groups, with four groups placed in the training set and the rest used for testing.In each CV, we randomly select negative samples with the same number of positive samples as the four groups for training, and the remaining negative samples are used for testing.
By sorting the samples using our method's scores and different thresholds, the ROC curve can be plotted and the area under the ROC curve (AUC) can be obtained.

| Optimal parameter selection
To evaluate the impact of the parameters in ANS-SCMC, we analyse the impact of the parameters K of the WKNKN function to adjust the extraction of microbial features.We set the parameter K within the range 1, 10 with the step size of 1.The experimental results demonstrate that when the parameter K is set to 6, ANS-SCMC achieves the best performance in both five-fold CV and LOOCV, as shown in Figures 3 and 4. Therefore, we ultimately set the parameter K in WKNKN to 6.
Next, we analysed the hyperparameters r 1 and r 2 in the adaptive similar matrix method ANS.Specifically, we focused on the two parameters r 1 and r 2 in the ANS method for obtaining microbial and disease weight matrices, with their ranges set to 1, 10 and step sizes set to 1.The experimental results show that when the parameters

| Algorithm comparison
We evaluated the MDA prediction performance of the proposed ANS-SCMC method on the HMDMD dataset, using LOOCV and five-fold CV, and compared it with six other MDA identification methods.These methods are MNNMDA, 36 NTSHMDA, 37 LRLSHMDA, 38 KATZHMDA, 39 BiRWHMDA 40 and BRWMDA. 41The experimental results are shown in Figure 10.We found that ANS-SCMC exhibits excellent performance on HMDAD, with the highest AUC values of 0.9789 on LOOCV and 0.9758 on five-fold CV, respectively.Overall, compared with the other six methods, ANS-SCMC has excellent predictive performance (Figure 11).

| Case
In our study, we generated a training set from a known microbial disease association dataset to evaluate the effectiveness of our proposed ANS-SCMC method in predicting unknown microbial disease associations.Through this method, we generated an association prediction score for each unknown microbial disease pair and ranked these scores in descending order.Next, our goal is to find new microbes to treat type 1 diabetes and Crohn's disease (CD).

| Type 1 diabetes
Type 1 diabetes is a chronic autoimmune disease that causes the pancreas to produce very little or no insulin. 42Insulin is a hormone that helps blood sugar enter body cells and convert it into energy.
Lack of insulin can lead to blood sugar not being absorbed by cells, but accumulating in the bloodstream, resulting in high blood sugar and related symptoms and complications. 43  of all diabetes cases.are currently no known preventive measures. 44 our study, the ANS-SCMC method was used to explore new microbes related to type 1 diabetes.In the HMDAD database, we predicted the first 10 microbes that may be related to type 1 diabetes.Among these 10 potential related microbes, 8 have been validated in the literature.In addition, our study also suggests that Whipple barrier organism and oxalobacter may be associated with type 1 diabetes.As shown in Table 1.

| Crohn's disease (CD)
Crohn's disease (CD) is an inflammatory bowel disease that may affect any part of the digestive tract.Its symptoms include abdominal pain, diarrhoea, fever, bloating, and weight loss. 45The exact cause of In this study, we applied the ANS-SCMC method to identify microbes that may be associated with Crohn's disease.In the MDA database, we predicted the top 10 microbes that may be associated with Crohn's all of which were validated by the database or existing literature.As shown in Table 2.

| D ISCUSS I ON AND CON CLUS I ON
This study proposes the ANS-SCMC method, aiming to discover new connections between microbes and complex human diseases.The ANS-SCMC method first calculates the Gaussian similarity between similarity for both and diseases Finally, the SCMC method is employed to decompose the preprocessed loss function matrix into two low-rank matrices.The missing values in the original matrix are computed by iteratively updating the matrices using gradient descent.This allowed the two sub matrices to continuously approach the original matrix, resulting in the final score for disease microbial association prediction.We compared ANS-SCMC with six currently advanced MDA identification models in the HMDAD database and achieved the highest AUC values in both five-fold CV and LOOCV.In addition, we used ANS-SCMC in case studies to predict microbes related to type 1 diabetes and Crohn's disease.We compared the predicted top 10 microbes with the actual results, and the prediction accuracy in the verified microbe-disease association was 80% and 100% respectively.The results show that Whipple barrier organism and oxalobacter may be closely related to type 1 diabetes, and this discovery needs further biological experiments to verify.
In the future, we plan to design more accurate negative MDA screening methods by combining the biological characteristics of microbes, diseases, and MDA networks.We will also develop new deep learning models to improve the performance of MDA classification based on reliable negative MDA samples.We hope that the proposed ANS-SCMC method can help identify disease-related mi-

′d
is calculated by normalizing a new bandwidth parameter ′ d with the average number of associations between each disease and the microbe, as shown in Equation (4).

r 1 and r 2 Figure 9 ;
Figure 9; When parameters and are 0.1 and 0.3, respectively, LOOCV achieved the best performance, as shown in Figure 10.Based on these findings and the fact that the prediction accuracy of the model in LOOCV gradually stabilizes and reaches saturation at parameter ∈ 0.1,0.3 , we ultimately set the parameters and in SCMC both to 0.1.Therefore, we set the parameter R in SCMC to 0.01, and the parameters and to 0.1 and 0.1, respectively.

F I G U R E 4
Effect of different K in WKNKN under LOOCV.F I G U R E 5 Effect of different r 1 and r 2 in ANS under 5-fold CV.

F I G U R E 6
Effect of different r 1 and r 2 in ANS under LOOCV.F I G U R E 7 Effect of different R in SCMC under 5-fold CV.

Crohn's diseaseF I G U R E 8
is not yet clear, but it may be related to genetics, immune response, environmental factors, and gut microbiota.Although there is currently no cure for Crohn's disease, symptoms can be managed through medication, surgery, or lifestyle adjustments.Crohn's Effect of different R in SCMC under LOOCV.F I G U R E 9Effect of different and in SCMC under 5-fold CV.may cause various complications, as intestinal obstruction, intestinal fistula, intestinal perforation, malnutrition, anaemia, and other related inflammations.46

F I G U R E 1 0
microbes and diseases.It then utilizes the WKNKN algorithm to extract features and compute pairwise associations between microbes and diseases.Subsequently, it constructs adaptive neighbourhood Effect of different and in SCMC under LOOCV.F I G U R E 11 Results of ANS-SCMC comparison experiments.
Type 1 diabetes was previously known as insulin dependent diabetes or juvenile diabetes, but it can actually occur at any age, although it only accounts for Predicted top 10 microbes associated with type 1 diabetes by ANS-SCMC.Predicted top 10 microbes associated with Crohn's disease by ANS-SCMC.